52 research outputs found
Design and Implementation of MPICH2 over InfiniBand with RDMA Support
For several years, MPI has been the de facto standard for writing parallel
applications. One of the most popular MPI implementations is MPICH. Its
successor, MPICH2, features a completely new design that provides more
performance and flexibility. To ensure portability, it has a hierarchical
structure based on which porting can be done at different levels. In this
paper, we present our experiences designing and implementing MPICH2 over
InfiniBand. Because of its high performance and open standard, InfiniBand is
gaining popularity in the area of high-performance computing. Our study focuses
on optimizing the performance of MPI-1 functions in MPICH2. One of our
objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to
achieve high performance. We have based our design on the RDMA Channel
interface provided by MPICH2, which encapsulates architecture-dependent
communication functionalities into a very small set of functions. Starting with
a basic design, we apply different optimizations and also propose a
zero-copy-based design. We characterize the impact of our optimizations and
designs using microbenchmarks. We have also performed an application-level
evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2
implementation achieves 7.6 s latency and 857 MB/s bandwidth, which are
close to the raw performance of the underlying InfiniBand layer. Our study
shows that the RDMA Channel interface in MPICH2 provides a simple, yet
powerful, abstraction that enables implementations with high performance by
exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is
the first high-performance design and implementation of MPICH2 on InfiniBand
using RDMA support.Comment: 12 pages, 17 figure
NewMadeleine: An Efficient Support for High-Performance Networks in MPICH2
International audienceThis paper describes how the NewMadeleine communication library has been integrated within the MPICH2 MPI implementation and the benefits brought. NewMadeleine is integrated as a Nemesis network module but the upper layers and in particular the CH3 layer has been modified. By doing so, we allow NewMadeleine to fully deliver its performance to an MPI application. NewMadeleine features sophisticated strategies for sending messages and natively supports multirail network configurations, even heterogeneous ones. It also uses a software element called PIOMan that uses multithreading in order to enhance reactivity and create more efficient progress engines. We show various results that prove that NewMadeleine is indeed well suited as a low-level communication library for building MPI implementations
Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis
International audienceThe emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency
Efficient Intranode Communication in GPU-Accelerated Systems
Abstract—Accelerator awareness has become a pressing issue in data movement models, such as MPI, because of the rapid deployment of systems that utilize accelerators. In our previous work, we developed techniques to enhance MPI with accelerator awareness, thus allowing applications to easily and efficiently communicate data between accelerator memories. In this paper, we extend this work with techniques to perform efficient data movement between accelerators within the same node using a DMA-assisted, peer-to-peer intranode communication technique that was recently introduced for NVIDIA GPUs. We present a detailed design of our new approach to intranode communication and evaluate its improvement to communication and application performance using micro-kernel benchmarks and a 2D stencil application kernel. I
MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory
Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel programming systems in the same application. We introduce an MPI-integrated shared-memory programming model that is incorporated into MPI through a small extension to the one-sided communication interface. We discuss the integration of this interface with the MPI 3.0 one-sided semantics and describe solutions for providing portable and efficient data sharing, atomic operations, and memory consistency. We describe an implementation of the new interface in the MPICH2 and Open MPI implementations and demonstrate an average performance improvement of 40% to the communication component of a five-point stencil solve
Understanding the requirements imposed by programming model middleware on a common communication subsystem
Abstract. In high-performance parallel computing, most programming model middleware libraries and runtime systems use a communication subsystem to abstract the lower level network layer. The functionality required of a communication subsystem depends largely on the particular programming model implemented by the middleware. In order to maximize performance, middleware libraries and runtime systems typically implement their own communication subsystems that are specially tuned for the middleware, rather than use an existing communication subsystem. This leads to duplicated effort and prevents different middleware libraries from being used by the same application in hybrid programming models. In this paper we describe features required by various middleware libraries as well as some desirable features that would make it easier to port a middleware library to the communication subsystem, and allow the middleware to make use of high-performance features provided by some networking layers. We evaluate whether existing communication subsystems support these features efficiently. We show that none of the existing communication subsystems that we evaluated support all of the features.
Designing a common communication subsystem
Abstract. Communication subsystems are used in high-performance parallel computing systems to abstract the lower network layer. By using a communication subsystem, an upper middleware library or runtime system can be more easily ported to different interconnects. By abstracting the network layer, however, the designer typically makes the communication subsystem more specialized for that particular middleware library, making it ineffective for supporting middleware for other programming models. In previous work we analyzed the requirements of various programming-model middleware and the communication subsystems that support such requirements. We found that although there are no mutually exclusive requirements, none of the existing communication subsystems can efficiently support the programming model middleware we considered. In this paper, we describe our design of a common communication subsystem, called CCS, that can efficiently support various programming model middleware.
- …